by Inessa Prokofyeva
========================================================
This report explores white wine quality according dataset with variables on quantifying the chemical properties of each wine and wine expert’s ratings.
The dataset is taken from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The dataset contains 4898 observations with 13 variables. 11 variables contain chemical characteristics, the first “X” variable represents the ID, the “quality” variable is an experts rate of wine.
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
As soon as the dataset contains 3 characteristics of wine acidity fixed.acidity, volatile.acidity, citric.acid but commonly the complete acidity usually affects wine features, so I’ve created the 1 variable “total.acidity”. Also I’ve added the qualitative variable “rate” according expert’s “quality” value.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "rate" "total.acidity"
The dataset contains 4898 observations with 13 quantitative variables. Plotting all these distribution histograms are very informative for investigation start.
Acid level is very important in winemaking. Too little acid and your wine tastes flabby and non-committal. Too much acid and your wine will taste like vinegar. Acids must be properly countered with other ingredients in wine to be “in balance”. Let’s look at acids distributions in wine.
It looks like for fixed.acidity and citric.acidity distributions look close to normal, but for volatile.acidity there is a long tail distribution. Applying log10() function will transform the shape to be close to normal.
The plot for total.acidity looks very smooth and normal. As it was said before winemakers are trying to keep certain level of acidity but it seems that there are examples with very high level of acids. The majority of values are concentrated between 6 and 8 that are actually recommended values for dry and sweet white wines. The minimum value is 4.130, maximum 14.960, the median value is 7.4.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
Further we’ll investigate the influence of wine acidity to quality.
Different types of wines contain different amount of sugar in it. The distribution of residual sugar in wine looks positively skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Applying log10() transformation for better understanding the distribution of residual sugar. The transformed graph appears bimodal with peaks around 1.5 and 9. That can be explained by different types of wines: dry and off-dry. The main values are between 1 and 2.
According dataset description it is rare to find wines with sugar less than 1 gram/liter and wines with greater than 45 grams/liter. The dataset contains 77 observations with residual.sugar < 1 and only 1 with residual.sugar >= 45.
Less than 1 gram/liter:
##
## FALSE TRUE
## 4821 77
## residual.sugar quality
## 55 0.9 6
## 173 0.8 4
## 210 0.9 6
## 224 0.8 6
## 260 0.9 4
## 302 0.9 6
More than 45 grams/liter:
##
## FALSE TRUE
## 4897 1
## residual.sugar quality
## 2782 65.8 6
Most wines have alcohol percent between 9 and 12 with median value 10.4. There are not so much strong wines, there are the tendency that the higher alcohol level the less observations can be found. I don’t think that alcohol level very affects quality but it may affect other characteristics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
According to Alison Crowe of Winemaker Magazine “pH is the backbone of a wine”. The pH level affects many characteristics in wine to keep the balance of taste. In technology of winemaking pH < 3.3 is recommended for white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The distribution looks pretty normal so indeed we see that the major of values are concentrated between 3.0 and 3.3 pH levels. But there also wines with higher or lower pH values. Minimum value is 2.7 and maximum is 3.8.
Today the use of sulfur dioxide is widely accepted as a useful winemaking aide. It is used as a preservative because of its anti-oxidative and anti-microbial properties in wine, but also as a cleaning agent for barrels and winery facilities.
Looking at total.sulfur.dioxide distribution it is noticed that the major of values are concentrated between 100 and 150, but there are also wines with very high level of SO2 that makes the whole sulphates level in wine pretty high. That can be dangerous for people with allergies to sulfites. It’s pretty risky not to add sulfites at all but the high level of SO2 can speak about quality of wine. According natural winemakers they’re trying to use as little SO2 as possible and we see that the dataset contains wines with very little level of sulfites.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 4746 4746 6.1 0.26 0.25 2.9
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4746 0.047 289 440 0.99314 3.44
## sulphates alcohol quality rate total.acidity
## 4746 0.64 10.5 3 below average 6.61
In this dataset there is only one representation on quality level - experts rate. Plotting the count of wines of each level making pretty obvious the fact that the most of wine have median value 6. Minimum rate is 3 and there are no wines with top rate of 10 (maximum 9).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Rate > 8
##
## FALSE TRUE
## 4893 5
Rate < 4
##
## FALSE TRUE
## 4878 20
To describe wine quality without using numbers I’ve added qualitative variable “rate” just to easily describe if the certain wine is good or nor so much, describing with a number can be pretty confusing as user doesn’t know the scale. There are a lot of average wines but still you can find pretty good in this dataset.
## total.acidity residual.sugar pH sulphates alcohol quality rate
## 775 9.82 10.6 3.20 0.46 10.4 9 very good
## 821 7.25 1.6 3.41 0.61 12.4 9 very good
## 828 8.00 2.0 3.28 0.48 12.5 9 very good
## 877 7.60 4.2 3.28 0.36 12.7 9 very good
## 1606 7.85 2.2 3.37 0.42 12.9 9 very good
The dataset contains 4898 observations with 10 quantitative characteristics, 1 quantitative experts rate. For complete dataset description I’ve added 2 variables: quantitative total.acidity and qualitative rate.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "rate" "total.acidity"
I think the most interesting distributions I’ve got for acidity and pH levels. To make total.acidity distribution more close to normal I’ve used log10 transformation. But for pH level even without any transformation the graph is very close to ideal normal distribution. That says a lot about winemakers, so they’re trying to keep all these values in certain range.
Also the distribution of residual.sugar is pretty intriguing. There are two peaks of values. That can be explained by different types of wines. Also a lot of people are thinking that the less sugar is in wine the better it is, so I’ll try to find the correlation between these variables.
Of course the main feature of interest is quality. All chemical characteristics can easily influence each other but it’s very difficult to say significantly what best quality wine features are. But there are definitely some of correlations that will be investigated later here that can tell us how these variables can influence the wine quality.
To understand the way of investigation it can be helpful to read winemakers guides and observe dependencies tables of ingredients in wine. According these prospects (links are below) there must be correlations between pH and acids levels, acids and alcohol, sulfites level and quality. We’ll find it out.
As it was discussed earlier there are a lot of questions how wine characteristics influence each other and especially the quality. In this section lets understand these dependencies.
Beginning with scatterplots pair matrix can be a good start for the further investigation. As soon as there are some variables in dataset that describe the same characteristics so there is the plot only of some of them.
The strongest correlation we can see between residual.sugar & density and alcohol & density. And that’s pretty obvious as soon as density is an attitude between sugar and alcohol in wine.
##
## Pearson's product-moment correlation
##
## data: ww$residual.sugar and ww$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
##
## Pearson's product-moment correlation
##
## data: ww$alcohol and ww$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Density is pretty interesting characteristic that correlates with others characteristics. On graphs below there are the correlations between density and other variables. And it seems that there are not very strong but still existing relationships (except density and pH). These plots describe the impact of changing characteristics to density.
density and total.acidity
## [1] 0.2756088
density and chlorides
## [1] 0.2572113
density and pH
## [1] -0.09359149
density and total.sulfur.dioxide
## [1] 0.5298813
There is a tendency that density is growing with growth of fixed.acidity. The correlation is 0.2756088, so the trend is not very clear but it still is.
Of course that’s pretty curious how other characterictics impact each other but the main question is what impacts wine quality? To select the perfect quality wine we should know how to identify it or at least to know what characteristics to look for. The boxplots below show us the specific values for wines of different rates.
According previous plots it seems that there is a trend that density for better quality wines decreasing and has smaller range. As expected the widest range is for average quality.
Quality: “below average”
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Quality: “average”
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Quality: “good”
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Quality: “very good”
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The similar picture we see for alcohol but in opposite direction. Better quality wines has higher percentage of alcohol.
The range for sugar amount in wine is smaller for very good wines instead of others. The biggest range is for average what is predictable. But still there is one oultier for best wines that says that the sugar lever can’t be the only one factor for quality. So trying to find a bottle of no sugar wine in the store won’t guarantee it’s quality.
SO2 level median values are close for wines with different rates. But for very good wines there are much less outliers and the range is smaller that for others. As we know this characteristic is very important in winemaking, this graph tells us about elaborated techniques that help winemakers to get best wine quality without adding extra SO2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
As it was mentioned before the pH level is the core of wine quality. Here we see that almost all wines pH level is kept between 3.0 and 3.3, however the better wines pH lever is higher and don’t have the outliers rather than other wines. pH level is an index that all ingredients and flavors in wine are in balance.
I was never thinking that the alcohol level can be significant for drinks and especially for wines. But as we can see here the alcohol level for very good quality wines is much higher that for others. The alcohol median value and range are probably the same for bad and average wines. But as higher alcohol level the better quality according graphs. I don’t think that it’s a rule, but a curious notice. And there is an exception on this rule (outlier on a graph).
According previous observations it was noticed that the median value of main characteristics are pretty close for wines with different rates. But for very good wines the range is much smaller than for others. For example, the sulphates level is kept between 108 and 167 mg / dm^3. Probably this level of SO2 will keep barrels of bad bacteria and won’t affect fermentation processes. And for other wines there is big range in this value and we can’t say surely what makes winemakers add this much of SO2.
The level of alcohol is also can determine the wine quality. As the dataset is not very big and there are only few wines that rated as “very good” but they are stronger than other. The same is the sugar level. It is more likely that the wine with less sugar can has better quality. As it’s said in winemaking prospects all defects can be fixed by adding sugar, but for good quality wines it’s not necessary.
The main characteristic that is very affected by others is density. And we can see it on graphs or by correlation values. That makes sense because all these variables just describe the same features but from different angles, so we can see higher correlation values in cases with density.
The strongest relationship were found between between residual.sugar & density and alcohol & density. As it was explained before that’s because they’re part of one equation: density is the ratio between residual sugar and alcohol. Instead of sugar and alcohol there is a relationship between density and total.sulfur.dioxide that equals 0.53 and the graph of this dependency can be seen above.
In previous section I’ve described the relationships between wine characteristics. For now, I’m interested how do they influence the quality together.
And as we saw before there is a tendency that better quality wines have higher alcohol level. The widest range in acidity is observed for average wines, for better quality there is smaller range in acidity and higher level of alcohol.
Let’s look at the similar picture but with residual sugar:
This chart shows the trend that the higher sugar level the stronger wine is. There is a major concentration of values closer to 0. For better wines there is no any concentration of dots, the values are spread along all sugar levels. But the tendency about alcohol is true for all types of wine, lower sugar level corresponds to lower level of alcohol (light blue dots on the bottom).
The next chart allows to see relationships with acidity and free.sulfur.dioxide. The range of SO2 values is wide for all wines except ‘below average’, the bulk of ‘below average’ wines data is concentrated on the bottom of graph. For average and good wines there is a trend that the SO2 level is higher than for bad wines.
There was shown before that density increasing as acidity increasing. Next plot shows what values for these two characteristics has wines according their quality. And the main amount of high quality wines are under the line
Finally I’ll plot the dependency between density and residual.sugar. On this graph we can see the layers: the values for average and bad wines are place under values for good wines. The more sugar is contained in wine with higher density. The trend is not very clear but it still presents.
The strongest correlation was found for density and residual.sugar, alcohol, total.sulfur.dioxide, chlorides and total.acidity. So there is a model predicting density according these variables.
##
## Calls:
## m1: lm(formula = density ~ residual.sugar, data = ww)
## m2: lm(formula = density ~ residual.sugar + alcohol, data = ww)
## m3: lm(formula = density ~ residual.sugar + alcohol + total.sulfur.dioxide,
## data = ww)
## m4: lm(formula = density ~ residual.sugar + alcohol + total.sulfur.dioxide +
## chlorides, data = ww)
## m5: lm(formula = density ~ residual.sugar + alcohol + total.sulfur.dioxide +
## chlorides + total.acidity, data = ww)
##
## ===============================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------------
## (Intercept) 0.991*** 1.005*** 1.003*** 1.003*** 0.999***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## residual.sugar 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## alcohol -0.001*** -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000) (0.000)
## total.sulfur.dioxide 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000)
## chlorides 0.003*** 0.003***
## (0.001) (0.001)
## total.acidity 0.001***
## (0.000)
## -------------------------------------------------------------------------------
## R-squared 0.7 0.9 0.9 0.9 0.9
## adj. R-squared 0.7 0.9 0.9 0.9 0.9
## sigma 0.0 0.0 0.0 0.0 0.0
## F 11637.0 23791.1 16738.6 12603.4 13854.3
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood 24498.9 27328.0 27448.4 27457.6 28176.6
## Deviance 0.0 0.0 0.0 0.0 0.0
## AIC -48991.7 -54648.0 -54886.8 -54903.3 -56339.2
## BIC -48972.3 -54622.1 -54854.3 -54864.3 -56293.7
## N 4898 4898 4898 4898 4898
## ===============================================================================
This model explains 90% of cases. And it’s curious that there is the same R-squared value starting with second combination, the lowest value 0.7 is only for first variable. So there are no big changes that other variables do for density and it’s explainable as soon as only 2 variables (residual sugar and alcohol) are really dependent for density.
The next model is created for quality prediction according characteristics we observed earlier.
##
## Calls:
## mq1: lm(formula = quality ~ density, data = ww)
## mq2: lm(formula = quality ~ density + total.acidity, data = ww)
## mq3: lm(formula = quality ~ density + total.acidity + total.sulfur.dioxide,
## data = ww)
## mq4: lm(formula = quality ~ density + total.acidity + total.sulfur.dioxide +
## residual.sugar, data = ww)
## mq5: lm(formula = quality ~ density + total.acidity + total.sulfur.dioxide +
## residual.sugar + alcohol, data = ww)
##
## =====================================================================================
## mq1 mq2 mq3 mq4 mq5
## -------------------------------------------------------------------------------------
## (Intercept) 96.277*** 92.551*** 89.589*** 237.150*** 66.530***
## (4.003) (4.132) (4.827) (8.008) (14.584)
## density -90.942*** -86.815*** -83.775*** -233.530*** -63.846***
## (4.027) (4.185) (4.906) (8.131) (14.593)
## total.acidity -0.050*** -0.051*** 0.026 -0.055***
## (0.014) (0.014) (0.014) (0.015)
## total.sulfur.dioxide -0.000 0.000 0.000
## (0.000) (0.000) (0.000)
## residual.sugar 0.097*** 0.045***
## (0.004) (0.006)
## alcohol 0.275***
## (0.020)
## -------------------------------------------------------------------------------------
## R-squared 0.1 0.1 0.1 0.2 0.2
## adj. R-squared 0.1 0.1 0.1 0.2 0.2
## sigma 0.8 0.8 0.8 0.8 0.8
## F 509.9 262.0 175.1 271.5 264.3
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -6112.0 -6105.6 -6104.9 -5863.7 -5769.1
## Deviance 3478.7 3469.6 3468.6 3143.4 3024.2
## AIC 12230.0 12219.2 12219.8 11739.5 11552.1
## BIC 12249.5 12245.2 12252.2 11778.5 11597.6
## N 4898 4898 4898 4898 4898
## =====================================================================================
Results of this model not so impressive as with the previous one. The R-squared value equals to 0.2, so the model explains only 20% of values. The dataset contains of 4898 observations and the most of wines are rated as ‘below average’ and ‘average’. There are very few values for ‘very good’ wines and it’s pretty hard to predict what characteristics high quality wine will have. Having the bigger dataset and more formal quality measure will increase the results.
Multivariable ploting mostly made the bivariable analysis more clear. There were observed that the better wines has the tendency for higher level of alcohol and smaller range of acidity. Also I’ve found that SO2 level is lower for bad wines then for others.
I’ve created 2 different linear model. The first one for density and it showed very high result (90%) of explaining the values. That’s because the density is a collective characteristic that can be expressed by others. The second model didn’t show this impressive results so the quality value can be predicted that easily according it.
It’s interesting that there were not found any strong dependency between any of characteristics and quality. According all created plots and hypothesis there are no strong correlations with quality variable. That may be because the quality rank is only bias experts opinion about wine or there are not very big amount of values in this dataset. Also there are no strict definitions about what ‘quality’ is? Maybe it’s the taste or smell or color or a lot of different features that can’be just described by several quantitative variables.
Density - one of the characteristics that is different for wines with different quality. The median value is decreasing for better quality, the better wine the smaller range of density. For “very good”" wine quality 50% value are between 0.9917 and 0.9961 with median value of 0.9937. So the winemakers kept the ratio of residual sugar to alcohol (the density) in that range. But there is an oulier of this rule.
There is a tendency of increasing density according increasing acidity level, for most of wines we see that there is a linear relationship between these two variables. To avoid overplotting I’ve used geomsmooth function to show the trend and the most of values for high quality wines are place below this line. Acidity level can vary but the density is lower for better wines. It’s still possible to find the examples of high quality wines with high level acidity and density but there are not so much such values.
All values are separated to buckets according wine rates. The ‘average’ bucket contains the most of values, and it’s obvious how little values correspond to ‘very good’ bucket. According alcohol level there are much more strong wines in last 2 buckets (more yellow dots) that correspond the hypothesis that better wines are stronger. Acidity level range looks pretty the same for all types, the values are evenly spread along acidity scale so it’s hard to say what acidity level corresponds higher quality.
The dataset contains 4898 observations with 11 variables. All these variables contain chemical characteristics of white wines, also there is an expert rate for each observation.
During the analysis there were found several characteristics that influence the quality: alcohol, acidity and SO2 level. But even though the created lineal model couldn’t explain the most of values. That can be explained by quality description. As soon as quality is a subjective expert’s rate this variety of chemical characteristics can be explained by personal tastes. But there were discovered that the better quality wines have higher level of alcohol.
Even there are almost 5000 observations in dataset there are very little amount of high quality wines (according expert’s rates), and the same for bad quality wines, the rate is scaled from 1 to 10 but we can find only 5 with rate 9 and no examples with rates 1,2 and 10. Probably expanding the dataset will increase the ability of current linear model in quality prediction. Also creating more formal definition of quality also will help in this case.
It will be more interesting to investigate the dataset having the wine prices and knowing the manufacturer. According these values and current chemical characteristics there can be created models for predicting price or finding specific characteristics for specific brand.
The following materials were used during analysis: